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The performance of over 2000 students in introductory calculus-based electromagnetism (E&M) 
courses at four large research universities was measured using the Brief Electricity and Magnetism 
Assessment (BEMA). Two different curricula were used at these universities: a traditional E&M 
curriculum and the Matter & Interactions (M&I) curriculum. At each university, post-instruction 
BEMA test averages were significantly higher for the M&I curriculum than for the traditional 
curriculum. The differences in post-test averages cannot be explained by differences in variables 
such as pre-instruction BEMA scores, grade point average, or SAT scores. BEMA performance on 
categories of items organized by subtopic was also compared at one of the universities; M&I averages 
were significantly higher in each topic. The results suggest that the M&I curriculum is more effective 
than the traditional curriculum at teaching E&M concepts to students, possibly because the learning 
progression in M&I reorganizes and augments the traditional sequence of topics, for example, by 
increasing early emphasis on the vector field concept and by emphasizing the effects of fields on 
matter at the microscopic level. 
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I. INTRODUCTION 

Each year more than 100,000 students take calculus- 
based introductory physics at colleges and universities 
across the US. Such students must obtain a good working 
knowledge of introductory physics because physics con- 
cepts underpin the content of many advanced science and 
engineering courses required for the students' degree pro- 
grams. Unfortunately, many students do not acquire an 
adequate understanding of basic physics from the intro- 
ductory courses; rates of failure and withdrawal in these 
courses are often high, and a large body of research has 
shown that student misconceptions about physics persist 
even after instruction has been completed In recent 
years, there have been significant efforts to reform intro- 
ductory physics instruction 0, H, 0| ■ 

Reforms of the course content (curricula) of introduc- 
tory physics have not progressed as rapidly as reforms 
of content delivery methods (pedagogy). Most students 
are taught introductory physics in a large lecture for- 
mat; the shortcomings of passive delivery of content in 
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this venue are well-known [5|. A number of pedagogi- 
cal modifications that improve student learning 0, 
have been introduced and are in widespread use; these 
modifications range from increasing active engagement 
of students in large lectures (e.g.. Peer Instruction Qand 
the use of personal response system "clickers" @) to re- 
configuring the instructional environment Q. By con- 
trast, most students learn introductory physics following 
a canon of topics that has remained largely unchanged 
for decades regardless of the textbook edition or authors. 
As a result the impact of changes in introductory physics 
curricula on improving student learning is not well un- 
derstood. 

At many universities and colleges, the introductory 
physics sequence consists of a one semester course with a 
focus on Newtonian mechanics followed by a one semester 
course in E&M. There exist a number of standardized 
multiple-choice tests that can be used to assess objec- 
tively and efficiently student learning in large classes of 
introductory mechanics; some of these instruments have 
gained widespread acceptance and have been used to 
gauge the performance of thousands of mechanics stu- 
dents in educational institutions across the U.S. [1]. By 
contrast, fewer such standardized instruments exist for 
E&:M and no single E&M assessment test is widely used. 
As a result, relatively few measurements of student learn- 
ing in large lecture introductory E&M have been per- 
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formed. 

In this paper we report measurements of the perfor- 
mance of 2537 students in introductory E&M courses 
at four large institutions of higher education: Carnegie 
MeUon University (CMU), Georgia Institute of Technol- 
ogy (GT), North Carolina State University (NCSU), and 
Purdue University (Purdue). Two different curricula are 
evaluated: a traditional curriculum, which for our pur- 
poses will be defined by a set of similarly organized text- 
books in use during the study [23| and the Matter & 
Interactions (M&I) ^ curriculum. M&I differs from the 
traditional calculus-based curriculum in its emphasis on 
fundamental physical principles, microscopic models of 
matter, coherence in linking different domains of physics, 
and computer modeling [lOj [Oj ■ particular, M&I 
revises the learning progression of the second semester in- 
troductory electromagnetism course by reorganizing and 
augmenting the traditional sequence of topics, for ex- 
ample, by increasing early emphasis on the vector field 
concept and by emphasizin g th e effects of fields on mat- 
ter at the microscopic level |13| . Student performance is 
measured using the Brief Electricity and Magnetism As- 
sessment (BEMA) a 30-item multiple choice test which 
covers basic topics that are common to both the tradi- 
tional and M&I electromagnetism curriculum including 
basic electrostatics, circuits, magnetic fields and forces, 
and induction In the design of the BEMA, many 

instructors of introductory and advanced E&M courses 
were asked to judge draft questions to ensure that ques- 
tions included on the test did not favor one curriculum 
over another. Moreover, careful evaluation of the BEMA 
suggests the test is reliable with adequate discriminatory 
power for both traditional and M&I curricula [l^ . 

The paper is organized as follows: In Section [IT] we 
present a summary of BEMA results across the four in- 
stitutions which provides a snapshot of the performance 
measurements for students in both the traditional and 
M&I curricula. In Sections [rnlTVll we then discuss the de- 
tailed results from each individual institution in turn. In 
Section IVIII wc analyze BEMA performance by individ- 
ual item and topic, discussing possible reasons for per- 
formance differences, and we make concluding remarks 
and outline possible future research directions in Section 

ivml 



II. SUMMARY OF COMMON 
CROSS-INSTITUTIONAL TRENDS 

Comparison of student scores on the BEMA at all four 
academic institutions suggests that students in the M&I 
curriculum complete the E&M course with a significantly 
better grasp of E&M fundamentals than students who 
complete E&M studies in a traditional curriculum (Fig- 
ures [1] [21 & [3]) . (A description of the methodology used 
to define "significance" is given in Appendix 1X1) Broadly 
speaking, the profiles of students at all institutions were 
similar; the vast majority of students in both curricula 



were engineering and/or natural science majors. During 
the term, all students at a given institution were exposed 
to an instructional environment with similar boundary 
conditions on contact hours: large lecture sections that 
met for 2-4 hours per week (depending on the institution) 
in conjunction with smaller laboratory and/or recitation 
sections that typically met for 1-3 hours per week on 
average (again, depending on the institution). We em- 
phasize that, at a given institution, the contact hours 
were, for the most part, very similar for both M&I and 
traditional courses (see Sections IIIII - IVip . Both the av- 
erage BEMA scores (Figure [1]) and the BEMA score dis- 
tributions (Figure |2l) were obtained at all institutions by 
administering the BEMA after students completed their 
respective E&M courses. 

A measure of the gain in student understanding as a re- 
sult of instruction can be obtained by also administering 
the BEMA to students as they enter the course. Specif- 
ically, the average increase in student understanding is 
measured by the average percentage gain, G ~ O — I, 
where / is the average BEMA percentage score for stu- 
dents entering an E&M course, and O is the average 
end-of-course BEMA percentage score. It has become 
customary Q to report an average normalized gain g, 
where g = 0/(100 — /) and (100 — /) represents the max- 
imum possible percentage gain that could be obtained 
by a class of students with an average incoming BEMA 
score of /. For g reported in Figure [31 the Georgia Tech 
and Purdue data are shown only for students who took 
the BEMA both upon entering and upon leaving their 
E&M course. For the NCSU and CMU students in this 
study, / was not measured. In these cases, we estimate g 
using measurements of / for other similar student popula- 
tions at each institution (See Section [Vl and [Vll for details 
on, respectively, the NCSU and CMU estimates.) With 
these qualifications, the data (Figure [3]) show at all four 
academic institutions that students receiving instruction 
in the M&I curriculum show significantly greater gains in 
understanding fundamental topics in E&M than students 
who received instruction in a traditional curriculum. 

As we will discuss later, students who get A's in the 
course do better on the BEMA than those who get B's, 
who in turn do better than those who get C's. Compari- 
son of average BEMA scores for a given final course grade 
in E&M at CMU, NCSU, and GT suggests that, roughly 
speaking, M&I students perform one letter grade higher 
than students in the traditional-content course. For ex- 
ample, on average an M&I student with a course grade of 
B does as well on the BEMA as the traditional-content 
student with a course grade of A. 

In addition to the common features described here, 
the E&M instructional and assessment efforts contained 
a number of details unique to each academic institution. 
We discuss these details below (Section IIIIIIVip . 
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FIG. 1: Average post-instruction BEMA scores at four aca- 
demic institutions - The average BEMA test scores are shown 
for students who have completed a one-semester E&M course 
with either the traditional (TRAD) or Matter & Interactions 
(M&I) curriculum. The number of students tested for each 
curriculum at each institution is indicated in the figure. The 
error bounds represent the 95% confidence intervals on the 
estimate of the average score. 



III. GEORGIA TECH BEMA RESULTS 

The typical introductory E&M course at Georgia Tech 
is taught with three one-hour lectures per week in large 
lecture sections (150 to 250 students per section) and 
three hours per week in small group (20 student) labora- 
tories and/or recitations. In the traditional curriculum, 
each student attends a two-hour laboratory and, in a sep- 
arate room, a one-hour recitation each week; in the M&I 
curriculum, each student meets once per week in a sin- 
gle room for a single three-hour session involving both 
lab activities (for approximately 2 hours on average) and 
separate recitation activities (for approximately 1 hour 
on average). The student population of the E&M course 
(both traditional and M&I) consists of 83% engineering 
majors and 17% science (including computer science) ma- 
jors. 

Table U summarizes the Georgia Tech BEMA test re- 
sults for individual sections. In all traditional and M&I 
sections, Nq students in each section took the BEMA 
during the last week of class at the completion of the 
course, typically during the last lecture or lab session. 
Moreover, in the majority of both traditional sections 
(T1-T4, T8-T11) and M&I sections (Ml, M4 & M5), Ni 
students in each section took the BEMA at the begin- 
ning of the course during the first week of class, typically 
during the first lecture or lab section. Nj for a given 
section is approximately equal to the number of students 
enrolled in that section, while Nq is usually smaller than 
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FIG. 2: Post-instruction BEMA score distributions at four 
academic institutions - The percentage of students with a 
given BEMA test score is plotted for students who have com- 
pleted an E&M course with either a traditional (dot-dashed 
line) or M&I curriculum (solid line) at (a) GT, (b) Purdue, 
(c) NCSU, and (d) CMU. The arrows indicate the location of 
the average score for each distribution. The right-most arrow 
in each subfigure corresponds to the M&I course. The total 
number of students tested in each curriculum at each institu- 
tion is the same as in Figure[T] The plots are constructed from 
binned data with bin widths equal to approximately 6.7% of 
the maximum possible BEMA score (100%). 



Nj, sometimes substantially so (e.g., T3 and T4), due 
to the logistics of administering the test. Thus, in each 
section, only those Nm students who took the BEMA 
both on entering and on completion of the course are 
considered for the purposes of computing both the unnor- 
malized gain G and the normalized gain g. The BEMA 
was administered using the same time limit (45 minutes) 
for both traditional and M&I students. M&I students 
were given no incentives for taking the BEMA; they were 
asked to take the exam seriously and told that the score 
on the BEMA would not affect their grade in the course. 
Traditional students taking the BEMA were given bonus 
credit worth up to a maximum of 0.5 % of their final 
course score, depending in part on their performance on 
the BEMA. A performance incentive for only traditional 
students would not be expected to contribute to poorer 
performance of traditional students relative to M&I stu- 
dents, and, therefore, cannot explain the Georgia Tech 
differences in performance summarized by Figs. [T]and[21 

Figure a) demonstrates there was no significant dif- 
ference between traditional and M&I students in the dis- 
tribution of pre-test scores on the BEMA. The average 
pre-test score for all sections ranged from about 22% to 
28%; a section-by-section comparison suggests there is 
no significant difference in pre-test scores on the BEMA 



4 



Purdue 



NCSU 



CMU 




FIG. 3: Gain in student understanding of E&M at four aca- 
demic institutions - The increase in student understanding 
resulting from a one-semester traditional (TRAD) or Mat- 
ter & Interactions (M&I) course is measured using the av- 
erage normalized gain g. The number of students tested for 
each curriculum at each institution is: GT M&I: A'^ = 297, 
GT Trad.: iV = 887, Purdue M&I: N = 76, Purdue Trad.: 
N = 79, NCSU M&I: N = 79, NCSU Trad.: N = 48, CMU 
M&I: iV = 73, CMU Trad.: N = 116. The error bounds rep- 
resent the 95% confidence intervals on the estimate of the nor- 
malized gain. The estimates of g require the average BEMA 
scores for incoming students /; for the NCSU and CMU re- 
sults, I was computed differently than for the GT and Purdue 
results. (See Sections IIIIVI and IVII for details.) 



between individual sections. (See Table U and Appendix 
IX)) . As an additional check on student populations in 
the two curricula, we examined the students' grade point 
averages at the start of the E&M courses; no significant 
difference in incoming GPA was found [1^. Thus, the stu- 
dent population entering both courses is essentially the 
same. Additionally, because the BEMA pre-test averages 
and the distribution of BEMA pre-test scores are essen- 
tially the same for the GT students in both curricula, we 
focus our remaining discussion on the post-test scores. 

Figure Ufa) indicates the distribution of the BEMA 
post-test scores for the M&I group is significantly differ- 
ent than the distribution for the traditional group. More- 
over, the BEMA post-test averages for each section (Fig- 
ure O suggest the M&I sections consistently outperform 
the traditional sections. The M&I BEMA averages across 
four different instructors are relatively consistent, while 
the BEMA averages of the traditional sections across five 
different instructors vary greatly. The use of Personal 
Response System (PRS) "clicker" questions may account 
for some of this difference. The lowest scoring sections 
(Tl, T2 and T5 in Figure O did not use clicker ques- 
tions; by contrast, approximately 2-6 clicker questions 



TABLE I: Georgia Tech BEMA test results are shown for 
five Matter & Interactions sections (M1-M5) and eleven tradi- 
tional sections (Tl-Tll). Different lecturers are distinguished 
by a unique letter in column L. (Note that lecturer B in M3 
was assisted by lecturer A.) The average BEMA score O for 
No students completing the course is shown for all sections. 
Moreover, in those sections where data is available, the av- 
erage BEMA score I for Nj students entering the course are 
indicated. Nm is the number of students in a given section 
who took the BEMA both at the beginning and at the end 
of their E&M course. GPA is the incoming cumulative grade 
point average for students in a given section. 



ID 


L 


Ni 


1% 


No 


o% 




GPA 


Ml 


A 


43 


24.5 ± 2.3 


40 


59.8 ± 4.8 


40 


2.96±0.18 


M2 


A 


n/a 


n/a 


149 


59.7 ± 2.8 


n/a 


2.99±0.10 


M3 


B 


n/a 


n/a 


146 


57.4 ± 2.6 


n/a 


n/a 


M4 


C 


138 


27.7 ± 1.9 


138 


59.5 ± 2.7 


132 


3.14±0.10 


M5 


D 


140 


24.7 ± 1.4 


139 


55.9 ± 2.9 


131 


3.07±0.09 


Tl 


E 


231 


22.9 ± 1.2 


204 


41.2 ± 1.9 


180 


3.10±0.07 


T2 


E 


219 


22.9 ± 1.3 


195 


40.7 ± 1.9 


176 


2.99±0.08 


T3 


F 


203 


25.7 ± 1.4 


136 


51.9 ± 3.0 


130 


3.01±0.09 


T4 


F 


212 


25.1 ± 1.4 


144 


50.8 ± 2.5 


133 


2.98±0.09 


T5 


E 


n/a 


n/a 


144 


38.3 ± 2.5 


n/a 


3.09±0.08 


T6 


G 


n/a 


n/a 


29 


45.2 ± 6.5 


n/a 


2.98±0.12 


T7 


G 


n/a 


n/a 


36 


44.5 ± 4.9 


n/a 


2.81±0.12 


T8 


H 


87 


28.1 ± 2.0 


73 


54.8 ± 4.7 


59 


2.97±0.13 


T9 


,1 


112 


26.5 ± 2.1 


84 


51.6 ± 3.7 


75 


2.94±0.11 


TIO 
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128 


25.3 ± 1.6 


103 


50.3 ± 3.0 


88 


3.04±0.09 


Til 
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127 


25.8 ± 1.8 


98 


49.5 ± 3.3 


82 


3.03±0.10 
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FIG. 4: Pre-test BEMA score distributions for Georgia Tech 
and Purdue - The distributions of BEMA test scores for stu- 
dents before completing an E&M course with either a tradi- 
tional (dot-dashed fine) or M&I curriculum (solid line) are 
shown for data from (a) GT (N = 1319 for traditional stu- 
dents, N = 321 for M&I students) and (b) Purdue (N = 78 
for traditional students, N = 76 for M&I students). The plots 
are constructed from binned data with bin widths equal to 
approximately 6.7% of the maximum possible BEMA score 
(100%). 



were asked per lecture in all M&I sections and all other 
traditional sections. Nevertheless, even when the com- 
parison between sections is restricted to the traditional 
sections with the highest average BEMA scores (Sections 
T3, T4, T8 and T9, which were taught by three different 
instructors who have a reputation of excellent teaching) , 
the M&I sections demonstrated significantly better per- 
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FIG. 5: Average BEMA scores by section at Georgia Tech - The average end-of-semester BEMA scores for 11 traditional (T#) 
and 5 M&I (M#) sections at GT are shown. The error bounds indicate 95% confidence intervals on the estimates of the average 
for each section. The number of students tested in a particular section is given by Nru in Table U 
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FIG. 6: The average post-test BEMA score of all students 
receiving a particular final course grade in introductory E&M 
at GT is shown. The error bounds indicate 95% confidence 
intervals on the estimates of the average for each grade. The 
number of students for whom grades were obtained is N = 
1233 for traditional students and N = 611 for M&I students. 



formance (Appendix |X| . 

The data in Figure [6] suggests a correlation between 
BEMA scores and final course grade at GT, with M&I 
students outperforming traditional students with the 
same final letter grade. Our finding that BEMA scores 
correlate strongly with final letter grade is not obvious. 
It seemed possible that the course grade was determined 
to a significant extent by the students' ability to work dif- 
ficult multistep problems on exams, whereas the BEMA 
primarily measures basic concepts which, it was hoped, 
all students would have mastered. However, we find M&I 
students exhibit a a one-lettcr-grade performance im- 
provement as compared with traditional students; specif- 



ically, the average BEMA scores are statistically equiva- 
lent between traditional A students and M&I B students, 
traditional B students and M&I C students, and tradi- 
tional C students and M&I D students. This difference 
in performance cannot be attributed to differences in the 
distribution of final grades; the percentage of students 
receiving a given final grade in the M&I sections (27.7% 
As, 37.8% Bs, 25.2% Cs, 7.2% Ds, and 2.1% Fs) is similar 
to that in the traditional sections (29.8% As, 34.4% Bs, 
24.3% Cs, 8.8% Ds, and 2.7% Fs). 



IV. PURDUE BEMA RESULTS 

The curriculum comparison at Purdue focuses on an 
introductory E&M course taught to electrical and com- 
puter engineering majors. The contact time was allo- 
cated somewhat differently for students in each curricu- 
lum; however, the total course contact time was simi- 
lar for both traditional and M&I students. Each week, 
traditional students met for three 50-minute large lec- 
tures (approximately 100 students per section) and two 
50-minute small-group recitations (25-30 students); these 
students did not attend a laboratory. M&I students met 
for two 50-minute lectures per week in large lecture sec- 
tions (approximately 100 students per section) and two 
hours per week in small group (25-30 students) laborato- 
ries. In addition. M&I students attended a small group 
(25-30 students) recitation once a week for 50 minutes. 
In all traditional and M&I sections, students in each sec- 
tion took the BEMA during the last week of class at the 
completion of the course, typically during the last lec- 
ture or lab session. Moreover, students in each section 
took the BEMA at the beginning of the course during 
the first week of class, typically during the first lecture 
or lab section. 

Figure [U^b) indicates M&I students significantly out- 
performed traditional students at Purdue. Students in 
both courses took the BEMA during a portion of a lab 
period with a 45-minute time limit for completion. Both 
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traditional and M&I students took the assessment (both 
pre and post) in the same week. The "initial state" of the 
two groups upon entering their respective E&M course 
was measured by comparison of the grade point aver- 
ages between the two classes; no significant difference was 
found [soj. Additionally, comparison of the distributions 
of the BEMA score upon entrance to the course shows 
only a small difference between the two groups (Figure 
UKb)) that cannot account for the large post-test differ- 
ence shown in Figure [^b). 



V. NORTH CAROLINA STATE BEMA 
RESULTS 

The introductory E&M course at NC State is typically 
taught with three one-hour lectures per week in large 
lecture sections (about 80 students per section). (Note, 
however, that one M&I section was taught in the SCALE- 
UP studio format Qj.) In the traditional curriculum, each 
student attended a two-hour laboratory every two weeks; 
in the M&I curriculum, each student attends a two-hour 
laboratory every week. Approximately three-fourths of 
the student population of the E&M course (both tradi- 
tional and M&I) are engineering majors. 

One hundred twenty-seven volunteers were recruited 
from eight different sections (700 students total) by 
means of an in-class presentation made by a physics edu- 
cation research graduate student. Students were paid $15 
for their participation in this out-of-class study. Prior to 
participation, students were told that they did not need 
to study for the test. Just before the end of the semester, 
several testing times were scheduled to accommodate stu- 
dent schedules. The test was given in a classroom con- 
taining one computer per student, with a proctor present; 
each student took the test using an online homework sys- 
tem. Each student took the test independently with a 
60- minute time limit. 




M1 M2 M3 M4 11 12 13 14 



FIG. 7: Average BEMA score by section at NCSU - The 
average end-of-semester BEMA scores for 4 traditional (T^^) 
and 4 M&I (M#) sections at NCSU are shown. The error 
bars indicate 95% confidence intervals on the estimates of the 
average for each section. The numbers of students tested are: 
iV = 7 for Tl, N = 10 for T2, = 16 for T3, = 15 for T4, 
N =16 for Ml, N = 22 for M2, N = 10 for M3 and N = 31 
for M4. Note that section M4 was taught in the SCALE-UP 
studio format. 



The difference in BEMA averages (shown in Figure 
[1]) between the M&I group and the traditional group is 
large and statistically significant as determined by the 
method outlined in Appendix [Al Because students were 
recruited from eight different sections, it is of interest 
to observe how students from each section performed on 
the BEMA. Figure [7] shows the average scores of the in- 
dividual sections for both M&I and traditional groups. 
Results of statistical tests (namely, the Kruskal-Wallis 
testfl^) show that there was no significant difference in 
BEMA scores among the M&I sections; similarly, no sig- 
nificant difference across the four traditional sections was 
detected. These results suggest that within each group 
students' BEMA scores were statistically uniform, and 
that the better performance of the M&I students was 
not due to a few outlier sections that could have biased 
the results. 
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FIG. 8: The average post-test BEMA score of all students 
receiving a particular letter grade in introductory E&M at 
NCSU is shown. The error bounds indicate 95% confidence 
intervals on the estimates of the average for each grade. The 
number of students for which grades were obtained is N = 48 
for traditional students and N = 79 for M&I students. 



One possible explanation of the results may be a re- 
cruitment bias; that is, higher-performing M&I students 
and lower-achieving traditional students may have been 
recruited for the study. To rule this out, participants' 
GPA, SAT scores as well as math and physics course 
grades (prior to taking the E&M course) were examined. 
The two math courses from which students' grades were 
collected were the first and second semester of calculus 
courses; the physics course for which students' grades 
were collected was the calculus-based mechanics course. 
Using the method described in Appendix |^ we found 
that there was no significant difference between the M&I 
group and traditional group in any of these grades. Ad- 
ditionally, no significant difference was found in the SAT 
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scores (verbal and math scores). These results suggest 
that the recruitment was not biased and that student 
participants from both the M&I sections and traditional 
sections had similar academic backgrounds (3l| 

In the NCSU study, students were not given the BEMA 
prior to the start of their E&M course. However, a num- 
ber of students from the same population, who were con- 
currently enrolled in introductory mechanics, did take 
the BEMA using via a web-based delivery system. The 
average BEMA score of the mechanics students was 23% 
[3^ . We use this value as an estimate for / to compute 
the normalized gains shown in Figure [3] which shows su- 
perior gain by M&I students. 

The data in Figure [5] suggests a correlation between 
BEMA scores and final course grade at NCSU, with 
M&I students outperforming traditional students with 
the same final letter grade. Moreover, we find M&I stu- 
dents exhibit a a one-letter-grade performance improve- 
ment; specifically, the average BEMA scores are statisti- 
cally equivalent between traditional A students and M&I 
B students. Such a performance difference might arise if 
fewer high final grades were awarded in M&I than in the 
traditional course; under these circumstances, the A stu- 
dents in M&I would be more select and, perhaps, better 
than A students in the traditional course. In fact, how- 
ever, a somewhat larger percentage of higher final grades 
were earned in the M&I sections (40.5% As, 43.0% Bs, 
12.7% Cs, 2.5% Ds, & 1.3% Fs) than in the traditional 
sections (25.0% As, 54.2% Bs, 18.7% Cs, 2.1% Ds, & 0.0% 
Fs). Thus, the difference in performance on the BEMA 
cannot be attributed to differences in the distribution of 
final grades. 



VI. CARNEGIE MELLON RETENTION STUDY 

The introductory E&M course at Carnegie Mellon con- 
sisted of a large (^ 150 students) lecture that met three 
hours per week and a recitation section that met two 
hours per week; there was no laboratory component to 
this course. For historical reasons, the course was sepa- 
rated into two versions: one for engineering majors that 
used the traditional curriculum and one for natural and 
computer science majors that used the M&I curriculum. 
The pedagogical aspects of both the traditional and M&I 
courses were quite similar. 

To probe the retention of E&M concepts as a function 
of time, two groups of students were recruited from each 
curriculum: (1) Recent students of introductory E&M, 
i.e., students who had taken the introductory E&M fi- 
nal exam 11 weeks prior to BEMA testing, and (2) "old" 
students of introductory E&M, who had completed in- 
troductory E&M anywhere from 26 to 115 weeks prior to 
BEMA testing. A total of 189 students volunteered for 
the study out of a pool of 1200 CMU students who had 
completed introductory E&M at CMU and who were sent 
a recruitment email by a staff person outside the physics 
department. With a promise of a $10 honorarium, the 



email asked for volunteers to take a retention test on 
an unspecified subject and stated that the test's pur- 
pose was to contribute to improvement in introductory 
courses. The student volunteers took the BEMA during 
the evening in a separate proctored classroom. Just be- 
fore taking the test, students were again told that they 
could help improve instruction at CMU by participat- 
ing and doing their best; a poll of the students indicated, 
with one exception, that the volunteers arrived at the ex- 
amination room without knowledge of the test's subject 
matter. No pre-test was given to the students; however, 
an estimate of /, the average BEMA score prior to enter- 
ing the E&M course, was obtained by a separate study. 
To obtain this estimate, a different group of volunteers 
drawn from the appropriate pool of potential students 
for each curriculum, i.e., engineering students who had 
not yet taken the traditional E&M course and science 
students who had not yet taken the M&I E&M course. 
These volunteers were given the BEMA; we estimate / 
= 28% (N=14) for the traditional courses and 7=23% 
(N=10) for M&I. 

Disregarding the length of time since completing the 
E&M course, it was found that the average BEMA score 
O = 41.6% for students in the traditional curriculum is 
significantly lower than the O — 55.6% for students in the 
M&I curriculum. The participants from each course were 
not significantly different in background as measured by 
the average SAT verbal or math score. 
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FIG. 9: Retention of E&M knowledge - Average scores on 
the BEMA vs time since completion of a course in introduc- 
tory E&M are shown for students at CMU from either a tra- 
ditional curriculum (TRAD) or the Matter and Interactions 
(M&I) curriculum. The error bounds indicate 95% confidence 
intervals on the estimates of the average for each section. The 
numbers of students tested were 116 for traditional and 73 for 
M&I. 

Figure [5] shows that E&M knowledge as measured 
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by the BEMA showed a significant loss over the re- 
tention period for both M&I and traditional students. 
While the M&I groups showed greater absolute reten- 
tion at all grade levels than the traditional groups, the 
BEMA performances of students who most recently com- 
pleted the E&M course were also greater in the M&I 
group. The rate of loss in the two groups appeared to 
be the same, a result typically found in the experimen- 
tal analysis of retention when comparing different initial 
"degrees of learning" [H, [13 . Thus, as measured by 
BEMA performance we could not determine unequivo- 
cally that M&I improved retention of E&M knowledge 
over the traditional course beyond effects due to initial 
differences in performance on the BEMA. It's worth not- 
ing here that recent work has shown that better reten- 
tion occurs for students exposed to improved pedagogical 
techniques [l^. 
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FIG. 10: The average BEMA score for students receiving a 
particular final grade in introductory E&M at CMU is shown. 
In all cases, the BEMA test was administered 11 weeks after 
the completion of the course. The error bounds indicate 95% 
confidence intervals on the estimates of the average for each 
grade. The number of students for which grades were ob- 
tained is N = 14 for traditional students and N = 32 for M&I 
students. 

The data in Figure [10] suggests a correlation between 
BEMA scores and final course grade at CMU, with 
M&I students outperforming traditional students with 
the same final letter grade. Moreover, we find M&I 
students exhibit a one-letter-grade performance improve- 
ment; specifically, the average BEMA scores are statisti- 
cally equivalent between traditional A students and M&I 
B students. Such a performance difference might arise if 
fewer high final grades were awarded in M&I than in the 
traditional course; under these circumstances, the A stu- 
dents in M&I would be more select and, perhaps, better 
than A students in the traditional course. In fact, how- 



ever, a somewhat larger percentage of higher final grades 
were earned in the M&I sections (34.3% As, 39.7% Bs, 
21.9% Cs, 4.1% Ds, and 0% Fs) than in the traditional 
sections (25.0% As, 37.9% Bs, 31.9% Cs, 5.2% Ds, and 
0.0% Fs). Thus, the difference in performance on the 
BEMA cannot be attributed to differences in the distri- 
bution of final grades. 



VII. ITEM ANALYSIS OF THE BEMA 

We have seen superior performance on the BEMA from 
M&I introductory E&M classes as compared to tradi- 
tional E&M classes across multiple institutions. One 
question that arises is whether this result can be ex- 
plained by M&I students performing better in any one 
topic or set of topics in the E&M curriculum. Because 
the content of the BEMA spans a broad range of topics, 
we can examine this question by dividing the individ- 
ual BEMA items into different categories and comparing 
M&I and traditional course performance in the individ- 
ual categories. There arc some subjective decisions to be 
made when categorizing the items based on content and 
concepts, including the number of categories, the partic- 
ular concepts they encompass, and which items belong 
to which categories. Furthermore, certain items may in- 
volve more than one concept and could potentially fall 
into more than one category. We decided, for simplicity, 
to group the BEMA items into just four categories cover- 
ing different broad topics, namely, electrostatics, DC cir- 
cuits, magnetostatics, and Faraday's Law of Induction. 
Each item was placed into one and only one category; re- 
fer to Figure [TT] for the items that comprise each category 
[33| . Note that this is an a priori categorization based 
on physics experts' judgment of the concepts covered by 
the items; it is not the result of internal correlations or 
factor analysis based on student data. Using these cat- 
egories, we compared M&I and traditional performance 
in each category. We chose to analyze the data from 
GT only, because we had the largest amount of data for 
traditional and M&I courses across a range of different 
lecture sections from this institution. 

We define the difference in performance between the 
two curricula as AG = Gm — Gt where Gm and Gt 
are the (unnormalized) gains for the M&I and tradi- 
tional curricula, respectively. In the same way, we can 
determine AGi, the difference in performance of the i*'' 
BEMA question; AG; is equal to the percentage of M&I 
students that answered the i*"^ question correctly minus 
the percentage of traditional students that answered the 
same question correctly. Using these quantities, we define 



the fractional difference in performance for the i 
question. can be thought of as the fractional contri- 



te 



bution of the i*'^ question to AG since 



AG, 
AG 



For equal weighting in the BEMA score (the scoring 
method that we used), a given question will make an 
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"average" contribution to AG when the magnitude of 
is approximately equal to the inverse of the num- 
ber of test questions (0.033 for the 30-question BEMA). 
Thus, when the magnitude of is significantly greater 
than 0.033, the corresponding question yields a greater 
than average contribution to AG. In addition, the sign 
of is noteworthy; a positive (negative) corre- 
sponds to an item where on average the M&I students 
scored higher (lower) than traditional students. (This 
presumes AG > 0, which is the case for our data.) 

The plot of for all questions on the BEMA pro- 
vides a kind of "fingerprint" for comparing in detail the 
performance of M&I and traditional students fFigure fTTj) . 
We see that the M&I course has positive for al- 

most all questions on the BEMA, and more than half 
of the questions (16) have values of greater than 
0.033. [3i|. The grouping of the BEMA questions by 
category permits one to visualize which topics contribute 
most strongly to the difference in performance. For ex- 
ample, the difference in performance in magnetostatics is 
striking, where nearly every question in this category has 
> 0.033; in fact M&I student performance on mag- 
netostatics alone accounts for more than half (55%) of 
the difference in performance AG relative to traditional 
students. The positive for DC circuits is worthy of 
note, even though these questions account for only 12% 
of AG. Qualitatively speaking, the M&I course seeks to 
connect the behavior of circuits to the behavior of both 
transient and steady-state fields; this focus is decidedly 
non-traditional. By contrast, the DC circuit questions 
on the BEMA are quite traditional, so it is tempting to 
think that the traditional course might provide better 
training for responding to such questions. However, Fig- 
ure [11] demonstrates that in fact M&I students outper- 
form traditional students on traditional DC-circuit ques- 
tions. Performance in electrostatics also generally favors 
the M&I course (28% contribution to AG); however, we 
see the performance on question #2 significantly favors 
the traditional course. The topic of question #2 is the 
computation of electric forces using Coulomb's law. It is 
possible that the difference is due to greater time spent 
in the traditional class on electric forces between point 
charges at the beginning of the course. The M&I cur- 
riculum also discusses forces on charges, but moves into 
a full discussion of electric fields due to point charges 
more quickly than the traditional course, thereby devot- 
ing less time to discussing forces exclusively. By contrast, 
we also see the largest single percentage difference in fa- 
vor of M&I in question #5, which deals with the direc- 
tion of electric field vectors due to a permanent electric 
dipole. The electric dipole plays an important role in 
the M&I curriculum due to the curriculum's emphasis 
on the effects of electric fields on solid matter and polar- 
ization, topics which are often skipped or de-emphasized 
in the traditional course; this particular result is there- 
fore not particularly surprising. As a final note, the large 
values of between M&I and traditional courses in 

both magnetostatics and Faraday's Law are interesting 
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FIG. 11: Fractional difference in performance for E&M 
subtopics - The fractional difference in performance be- 
tween M&I and traditional students at GT is shown for each 
question on the BEMA. Positive (negative) indicates su- 
perior performance by M&I (traditional) students. The nu- 
merical labels indicate the corresponding question number in 
order of appearance on the BEMA. The are grouped to- 
gether into one of four topics: Electrostatics (ES), DC circuits 
(DC), Magnetostatics (MS), or Faraday's Law and Induction 
(FL). 

because these topics are regarded as the most difficult 
for students due to their high level of abstraction and ge- 
ometric complexity. It is therefore striking that the M&I 
curriculum seems to be making the largest impact on the 
hardest topics, at least at Georgia Tech. 

As an independent check on the significance of our item 
analysis, we used the method of contingency tables as 
described in Appendix [X] to compare the M&I and tra- 
ditional students' average scores in each individual cate- 
gory. Here, a student's score in a category is computed as 
the sum of correct items in that category, where the num- 
ber of items in the four categories range from 2 to 12. The 
discrete nature of the data, as well as the non-normality 
and unequal variances of the distributions, make contin- 
gency tables the appropriate choice for this type of anal- 
ysis. On the pre-test, we found no significant association 
between course treatment (M&I versus traditional) and 
overall BEMA score on any category. In contrast, the 
results of the contingency table analysis (see Appendix 
lA 7p for the post-test scores show significant association 
of BEMA score with treatment in each category. We in- 
terpret this as showing better performance across topics 
for students in the M&I course. 



VIII. DISCUSSION 

We have presented evidence that introductory 
calculus-based E&M courses that use the Matter & In- 
teractions curriculum can lead to significantly higher stu- 
dent post-instruction averages on the Brief E&M Assess- 
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mcnt than courses using the traditional curriculum. The 
strength of this evidence is bolstered by the number of 
different institutions where this effect is measured and 
by the large number of students involved in the measure- 
ments. We interpret these results as showing that M&I is 
more effective than the traditional curriculum at provid- 
ing students with an understanding of the basic concepts 
and phenomena of electromagnctism. This interpretation 
is based on accepting that the BEMA is a fair and accu- 
rate measurement of such an understanding. We believe 
this is a reasonable proposition with which most E&M 
instructors would agree, given that the BEMA's items 
cover a broad range of topics common to most introduc- 
tory E&M courses. However, the BEMA was designed 
to measure just this minimal subset of common topics. 
There may be other topics in which traditional students 
would outperform M&I because they arc not taught or 
de-emphasized in the M&I course, and vice-versa. 

The BEMA is not the only instrument to assess stu- 
dent understanding of E&M concepts. The Conceptual 
Survey of Electricity and Magnetism (CSEM) was also 
designed for such a purpose [l^. With the exception of 
electric circuits, omitted by the CSEM, both instruments 
cover similar topics; in fact, several items are common to 
both tests. However, the CSEM contains questions in- 
volving field lines, a topic which is not covered in the M&I 
curriculum (a justification for this omission is discussed 
in [11]). Recent work has shown the CSEM and BEMA 
to be equivalent measures for changes of pedagogy [2^. 
Nevertheless, it would be interesting to use the CSEM in 
comparative assessments of traditional and M&I courses 
to see if it gives results similar to the BEMA; several of 
us are planning to do this in future semesters. 

A major research question raised by these results is 
how and why the M&I curriculum is leading to higher 
performance on the BEMA. The post-instruction BEMA 
results measure only the total effect of the content and 
pedagogy of the entire course; there is no way to tease 
out from these measurements the effects of any individ- 
ual elements of a course. While it is true that interactive 
instruction methods (clickers) were used in almost ev- 
ery M&I class measured, they were also used in many 
of the traditional classes. Recall that M&I sections at 
Georgia Tech still outperformed traditional sections with 
the two instructors noted for excellent pedagogical tech- 
niques. Overall performance differences are not likely to 
be explained by differences in overall time-on-task; the 
weekly classroom contact time was equivalent for both 
M&I and traditional students at two of the four insti- 
tutions (Georgia Tech and Purdue). Time-on-task for 
specific E&M topics may partially explain performance 
differences like those shown in Figure [TT] Comparing the 
percentage of total lecture hours devoted to each topic 
at GT, we find the M&I course spends significantly more 
lecture time than the traditional course on Magnetostat- 
ics (24% vs 12%); this is consistent with the superior 
performance of M&I students on this topic. However, we 
find superior performance of M&I students on Electro- 



statics, for which both courses spend nearly equal lec- 
ture time (36% vs 38%). In addition, we also find su- 
perior M&I performance of topics where the M&I course 
spends significantly less lecture time than the traditional 
course, namely, DC circuits (15% vs 25%) and Faraday's 
Law/Induction (6% vs 11%). We conclude that topical 
time-on-task alone is insufficient to account for perfor- 
mance differences on the BEMA. 

It is possible that the revised learning progression of- 
fered by the M&I E&M curriculum is responsible for the 
higher performance on the BEMA by M&I students. For 
example, more time is spent exclusively on charges and 
fields early in the course, laying conceptual groundwork 
for the mathematically more challenging topics of flux 
and Gauss's law which are dealt with later than is tradi- 
tional. Also, magnetic fields are introduced earlier than 
is traditional, giving students more time to master this 
difficult topic. Finally, M&I emphasizes the effects of 
fields on matter at the microscopic level. In some of the 
traditional courses discussed in this paper, dipoles and 
polarization arc not discussed. 
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APPENDIX A: HYPOTHESIS TESTING AND 
CONFIDENCE INTERVALS 

In this paper we have emphasized the use of error 
bounds to indicate the size of comparison effects, but 
we also sometimes mentioned statistical evidence for be- 
ing able to state that some comparison was or was not 
statistically significant. In this appendix, we present the 
details of how "significance" is determined. 

1. Is there a difference? 

In educational research we often wish to compare two 
or more methods of instruction to determine if (and how) 
they differ from each other. In our case, we attempt to 
address the question of whether instruction in Matter 
and Interactions (M&I) results in better performance on 
a standard test of Electromagnctism (E&M) understand- 
ing (i.e., the Brief Electricity and Magnetism Assessment 
or the BEMA) than instruction in the traditional course. 
We have gathered, under various arrangements, scores on 
this test for the M&I and traditional classes. Do they, 
in fact, differ? And just what do we mean by differ? 
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In what ways? We need a set of procedures to allow us 
to answer these kinds of questions under conditions of 
incomplete information. Our information is incomplete 
for a variety of reasons. While we are basically inter- 
ested in possible differential outcomes (e.g., in BEMA 
test scores) of our two instructional treatments (M&I and 
traditional) , there arc many factors that might affect the 
outcome, that is, obscure any real differences due to the 
treatments alone. The classes may differ in the abilities 
of the students, in the particular qualities of instructors, 
or in methods of course performance evaluation. We may 
address some of these concerns by attempts to equate var- 
ious conditions though proper sampling as well as more 
directly assessing potential differences. 

For simplicity, let us assume that we are drawing a ran- 
dom sample of size 2n from the same population, that 
is, from all possible physics students who could prop- 
erly participate in this study, a very large number, N. 
We then randomly assign our two treatment conditions 
to the sample, yielding two samples of size n. We then 
differentially expose these two samples to our two cur- 
ricula, M&I and traditional, and obtain a distribution of 
scores for each. Ideally, we were not restricted to sam- 
ples of size 2n (where 2n << N) randomly drawn from a 
parent population, but could subject that entire popula- 
tion to our differential treatments by randomly assigning 
the two treatments among all members of the popula- 
tion. Essentially dividing the original population into 
two equal sub-populations of size N/2. If our treatments 
had no effect, then the two sub-population distributions 
of scores would be identical and indistinguishable from 
the parent population. If, however, we could show that 
the two distributions differed, then we could say that our 
treatments produced two different populations. By dif- 
ferent, we mean that one or more parameters (e.g., mean, 
median, variance, etc.) of the populations differed from 
each other. But how big a difference is a difference? That 
question can be addressed by classical hypothesis testing 
procedures (i.e., statistical inference). 

Of course, we have to be satisfied by sampling from 
a population to obtain estimates of population param- 
eters, one illustration of incomplete information. Mea- 
sures, that is, functions defined on samples are called 
statistics. The arithmetic mean of a sample of size n, for 
example, is the sum of the sample values divided by n. 
The sample mean will then be an estimate of the popula- 
tion mean; the sample variance an estimate of the popula- 
tion variance, etc. Obviously, the larger the sample size, 
the better the estimate of a population parameter. If 
we draw multiple independent random samples and com- 
pute a statistic, we will obtain distribution of the sample 
statistic. A sample statistic is a random variable and its 
distribution is called a sampling distribution. Sampling 
distributions are essential to the procedures of statistical 
inference; they describe sample-to-sample variability in 
measures on samples. For example, if we are interested 
in determining whether two populations differ in their 
means; let us assume they are otherwise identical, we 



may draw a random sample of size n from each and com- 
pute the mean of that sample. Each value is an estimate 
of their respective population mean, but each is also but 
one value drawn from a distribution of sample means. If 
the two populations were, the same, then the two sample 
means would be just two estimates of the same popula- 
tion mean because they would have come from the same 
sampling distribution. The closer the two values are, the 
more likely this is the case; the greater their difference, 
the more likely it is they come from different sampling 
distributions and thus from different populations. "More 
(or less) likely" is a phrase calling for quantification and 
probability theory provides that through measures called 
test statistics. These have specific sampling distributions 
that allow probabilities of particular cases to be deter- 
mined by consulting standard tables or through statisti- 
cal packages. Common examples include the z statistic. 
Student's t, Chi-square, and the i^-distribution. All of 
these distributions are related to the normal distribution. 
The z-statistic is standard normal; the i, Chi-square, and 
F are asymptotically normal. With some exceptions, 
their applications assume normality of the sampled par- 
ent distribution, though they differ in robustness with 
respect to that assumption. In a typical implementation, 
a test statistic or, more commonly, a "statistical test" is 
chosen based on the stated hypotheses and by consider- 
ing assumptions made about population characteristics, 
sampling procedures, and study design. 

Statistical inference involves testing hypotheses about 
populations by computing appropriate test statistics on 
samples to obtain values from which probability es- 
timates of obtaining those values can be determined. 
These lead either to accepting or rejecting an hypoth- 
esis about some aspect of a population. Many hypothe- 
ses involve inferences about measures of central tendency 
(e.g., the mean) or dispersion (e.g., the variance). For- 
mal hypothesis testing is stated in terms of a null hy- 
pothesis, Hq, and a mutually exclusive alternative. Hi. 
The null hypotheses is assumed true and is rejected only 
by obtaining, through appropriate statistical testing, a 
probability value ( "p- value" ) less that some pre-assigned 
value. This probability value, called a the level of sig- 
nificance or a, is usually 0.05. Of course, one could be 
wrong in rejecting (a "Type I Error") or accepting (a 
"Type II Error" ) the null hypothesis regardless of the p- 
valuc obtained. A result is either statistically significant 
or not; there is no "more", "highly", or "less" significant 
outcome. 

For example, we can test the null hypothesis that the 
population mean scores for the M&I and traditional cur- 
riculum treatments are equal (i.e., the scores all come 
from a common population, assuming all other popula- 
tion parameters arc equal): 

Ho : umi = tJ^T (or, ^imi - mt = 0) (Ala) 
Hi : umi ^ [IT (or, [lui - Mr 7^ 0) (Alb) 

In this case we are considering two populations from 
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which we sample independently and for each pair of sam- 
ples we calculate the sample means and then take their 
difference. We then have a sampling distribution of sam- 
ple mean differences. If our Hq is true, we would expect 
the mean of that distribution to be zero. 

If the probability p corresponding to the computed 
test statistic is less than a selected threshold, typically 
p < 0.05, then the hypothesis of equal means is rejected. 
We deem the difference statistically significant and infer 
that the two populations are statistically different. By 
contrast, if p corresponding to the computed statistic is 
greater than a pre-assigned value, then the hypothesis 
of equal means cannot be ruled out. The null and its 
associated alternative hypothesis can be more specific, 
for example, the above alternative hypothesis could be 
Hi ■ tJ-Mi > Mt (or, ^XMI - > 0). 



2. Is it normal? 

Which test statistic is most appropriate? As already 
indicated, this depends on a number of factors including 
sample size and the characteristics of the parent popu- 
lations from which the samples are drawn. Recall we 
perform our tests based on the sampling distributions of 
the particular statistics of interest. If we are interested 
in testing hypotheses about differences in means, then 
we will be concerned about the sampling distribution of 
those differences. How do the parent distributions from 
which our samples are drawn affect the sampling distri- 
bution? If the parent distributions are normal, then the 
sampling distribution of differences in means will also 
be normal. The difference of means is a simple linear 
transformation of the parent distributions. What if the 
parent distributions are not normal? There are tests for 
this, given our sample distributions, as we indicate sub- 
sequently. But the Central Limit Theorem states that if 
random samples of size n are drawn from a parent distri- 
bution with mean ^ and finite standard deviation cr, then 
as n increases, the sampling distribution approaches a 
normal distribution with mean /i and standard deviation 
o I \fn. Hence normality of the parent distribution is not 
required. This applies equally to the case of differences 
in sample means. Thus, given large enough sample sizes, 
we might be tempted to directly use the z-statistic to test 
our hypotheses about means. However, this test assumes 
we know the population variances and we virtually never 
do. We might then resort to the f-distribution in which 
we estimate population variances from our samples, but 
the t-distribution assumes normality of the parent distri- 
butions. While the <-distribution tests are relatively ro- 
bust with respect to this assumption, not all parametric 
tests we wished to perform on our data are. Moreover, 
the t-distribution tests are not ap propria te for samples 
drawn from skewed distributions [2l|, |22| . As we show 
below, through an appropriate test we found the BEMA 
scores in our studies were likely drawn from non-normal 
and skewed population distributions. 



Fortunately, there are powerful distribution-free meth- 
ods, often called "non-parametric" statistics, that place 
far fewer constraints on parent distributions. So, to be 
both consistent and conservative we subjected all our 
data to statistical tests using these methods. However, 
because our sample sizes were typically large, we were of- 
ten able to take advantage of the Central Limit Theorem 
and thus ultimately make use of the normal distribution. 



3. Is it not normal? 

In Figure[2]we display the distributions of scores on the 
BEMA for the M&I and traditional groups at the various 
institutions where our studies were conducted. Are we 
justified in assuming normality of the parent populations 
given these sample distributions? Our null hypothesis 
would be that each of these sample distributions reflects 
a population normal distribution with unknown mean 
and variance against the alternative hypothesis that the 
population distributions are non-normal. The general 
method is a goodness-of-fit test originally developed by 
Kolmogorov and extended by Lilliefors to of an unspec- 
ified normal distribution [23, H^]. The basic approach 
is to assess the difference between a normal distribution 
"constructed" from the data and the actual data. The 
data consist of a random sample Xi, X2, . . . , X„ of size 
n drawn from some unknown distribution function, F{x). 
Recall, a distribution function is a cumulation (i.e., an in- 
tegral) of a probability distribution or density function. 
The normal distribution function is the familiar ogive, 
the integral of the normal density function - giving the 
probability: P{x < a) = F(a),(0 < F{x) < 1). Under 
the null hypothesis, F(x) is a normal distribution func- 
tion and we can estimate its mean and standard deviation 
from our data. The maximum likelihood estimate for the 
mean, /i, is 



1 " 



(A2) 



The standard deviation, cr, is estimated by: 



1 " 



X) 



(A3) 



These values allow us to specify our hypothesized nor- 
mal distribution. We can now "construct" our empirical 
distribution S(x) by computing z-scorcs from each of our 
sample values, defined by 



(A4) 



Now, we draw a graph of S{x) using these Zi values and 
superimpose the normal distribution function F{x) from 
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our estimated parameters. The Lillicfors test statistic is 
remarkably simple: 

Tl = max|F(a;) - S'(a;)| ; (A5) 

that is, the maximum vertical distance between the two 
graphs. A table with this test statistic's p-values can be 
found in standard texts on non-parametric statistics (23 | 
or from appropriate statistical packages. For large sam- 
ples {n > 31), the p < 0.01 value, for example, is de- 
termined as 1.035/dn where dn — (-^n ~ 0.01-1- 0.83/v^)- 
Obtaining a value that large or larger leads to rejection 
of the null hypothesis of normality at the 0.01 level of 
significance. 

Using this test, all the distributions shown in Figure 
[5] were determined to be significantly different from nor- 
mal. At the 0.05 level, the following test distributions 
were found to be non-normal: GT M&I pre-test, GT 
traditional pre-test, GT M&I post-test, GT traditional 
post-test, Purdue M&I pre-test, Purdue traditional pre- 
test, Purdue M&I post-test, Purdue traditional post-test, 
NCSU traditional post-test, and CMU traditional post- 
test. Two test distributions, NCSU M&I post-test and 
CMU M&I post-test, were found to be normal. Demo- 
graphic data was subjected to this test as well. The fol- 
lowing demographic data were found to be non-normal at 
the 0.05 level: GT M&I CPA, GT traditional CPA, GT 
M&I E&M grade, GT traditional E&M grade, Purdue 
M&I CPA, Purdue traditional CPA, NCSU M&I CPA, 
NCSU traditional GPA, NCSU M&I SAT score, NCSU 
traditional SAT score, NCSU M&I Math and Physics 
GPA, NCSU traditional Math and Physics GPA, CMU 
M&I SAT score, and CMU traditional SAT score. Given 
these results, we elected to adopt statistical tests that 
did not assume normality. 



variance tests we used arc based on ranks and are akin 
to distribution-free methods for testing differences be- 
tween group means (or medians). Because the latter two- 
sample tests are easier to describe, we begin with hypoth- 
esis testing about differences between groups in measures 
of central tendency. A brief description of the variance 
tests will follow. Aside from the assumptions about equal 
variances, the tests to be described only assume random 
samples with independence within and between samples. 

We have two samples Xi {i ~ 1 . . . m) and Yt {i = 
1 . . .n) for a total of m + 7i = observations. For ex- 
ample, the Xi's could be scores on the BEMA from tra- 
ditional classes and the l^'s BEMA scores from the M&I 
classes. Putative differences in measures of central ten- 
dency, whether referring to means or medians, are some- 
times called location shifts. Assuming the distributions, 
whatever their shape, arc otherwise equal, then changes 
in the mean or median of one (e.g., produced by an ex- 
perimental treatment) merely shifts it to the right or left 
by some amount A. If we are interested in differences in 
means, then 

A = E{Y) - E{X), (A6) 

the difference in expected values of the distributions is a 
measure of treatment effect. 

Let F(t) be the distribution function corresponding to 
population of traditional students and G{t) the distribu- 
tion function corresponding to population of M&I stu- 
dents. Om' null hypothesis tested is 

Hq : F{t) = Git), for every t. (A7) 

That is. 



4. The two-sample tests and assumptions about 
variance 

As already discussed in the initial section of the Ap- 
pendix, a number of our questions involved comparisons 
between M&I and traditional treatments under various 
conditions. In standard parametric statistics, hypothe- 
ses testing of such comparisons makes assumptions about 
variances in the populations under test. For example, t- 
tests of differences in means used with two independent 
samples assume, in addition to normality, that the pop- 
ulation variances are equal. Likewise, in analysis of vari- 
ance (ANOVA) tests of differences between means with k 
independent samples (fc > 2) also assume equal variances 
in the populations under test. This assumption is called 
homogeneity of variance. 

Curiously, assumption of equal variances also extends 
to typical distribution-free methods testing hypothesis 
about differences between or among treatments |23l . [23 | . 
Thus, before applying such tests, we tested the hypoth- 
esis of equal variance. In all cases tested we were un- 
able to reject the null hypothesis of equal variances. The 



Ho:A = (A8) 

The alternatives are 

Hi : F{t) ^ G{t) (A9) 

or 

H[ : G{t) = F{t + A), for every t. {i.e., A > 0) (AlO) 

These two alternatives reflect whether we are sim- 
ply interested in showing any difference, for example, 
whether entering scores on the BEMA for the M&I and 
traditional groups differ; or, as in H[, whether post- 
instruction BEMA scores for the M&I group exceed those 
of the traditional group. 

The Wilcoxon (or Mann- Whitney) test statistic, W, is 
based upon rankings of the sample values. The procedure 
is simple: Rank the combined sample scores N = m + n 
from least to greatest, then pick ranked observations from 
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one of the samples in the combined set, say M&I ranks, 
R{Yj), and sum them. 



(All) 

3 

There are methods for handUng tied ranks that we will 
not discuss in detail here. Having chosen a level of sig- 
nificance, a, and the particular alternative hypothesis, 
tables of this statistic can be found in any text on non- 
parametric statistics or from standard statistical software 
packages. 

If, as in our case, the sample sizes are large (n > 20), 
then W approaches the normal distribution and one can 
use the standard normal tables. The large-sample ap- 
proximation to W is found from the expected value and 
variance of the test statistic W, 



5. Homogeneity of variance tests 

Perhaps the simplest test of homogeneity of variance 
is the squared-ranks test [2^. It is quite similar to the 
Wilcoxon test described in Section [X4l Because it con- 
cerns variances, squaring certain values plays a role. Re- 
call the definition of the variance of a distribution as the 
expected value of (X — /z)^. If the mean of the distribu- 
tion is unknown, as discussed before, we estimate it from 
our sample. In the case of testing equality of variances 
with two independent samples, we have one random sam- 
ple of m values, Xi, X2, . . . , X,„, and another of size n, 
Yi, Y2, . . . , Yn. 

We now determine the absolute deviation scores of each 
value from their respective sample means, 

U^ = \X,-X\,i^l,...,m (A15) 

and 



E{W) 



var{W) 



n{m - 



1) 



mn{m -I- n + 1) 
12 ' 



(A12) 
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The large-sample version of the Wilcoxon statistic, Wz, 
is then 



Wz 



W -E{W) 
^yvar{W) 



(A14) 



Under the null hypothesis, Wz approaches the stan- 
dard normal distribution A^(0, 1), so we reject if 
Wz > Za where a is our level of significance. 

The comparison data from each institution shown in 
Figure [2] were each tested for significant differences be- 
tween post-instruction BEMA scores in the traditional 
and M&I treatments {Hi : A > 0). In all cases, the M&I 
groups were shown to outperform the traditional groups 
at the 0.05 confidence level. 

In addition, the pre-instruction BEMA scores from 
Georgia Tech and Purdue shown in Figure S] were tested 
for differences. In this case, we found we could not reject 
the null hypothesis for Georgia Tech, but we detected 
a significant difference in the Purdue populations. For 
a discussion Purdue pre-instruction differences, refer to 
SectionlTVl Finally, we tested demographic data, as listed 
at the end of Section lA3l of this Appendix. Matched sets 
were compared, e.g. GT M&I GPAs and GT traditional 
GPAs. We found that we were unable to reject the null 
hypothesis. Ho : A = 0, in each matched set. Hence, we 
conclude that the student populations at each institu- 
tion are similar insofar as GPAs, SAT scores and grades 
in Physics and Calculus courses are concerned. 



1, 



(AI6) 



As in the Wilcoxon test, we obtain the ranks of the 
combined deviation scores, a total of N ^ m + n. If 
there are no ties, then the test statistic is simply based on 
the squares of the ranks from one of the samples, say, the 
VVs (if there are ties, the expression is more complicated, 
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The null hypothesis, If 0, is Var{X) = Var{Y) and the 
alternative. Hi is Var{X) ^ Vari^). This test is not 
affected by differences in means, because variances of dis- 
tributions are not affected by location shifts. 

If, as in our case, the sample sizes are large, this test 
statistic approaches the standard normal distribution. 



D-n(iV+l)(27V + l)/6 
^mn{N -f- l)(2iV + 1)(87V + 11)/180' 
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Because we are only interested in any difference re- 
gardless of direction, we reject Hq if 



(A19) 



where a is our chosen significance level. Modifications 
of this test can be used to test differences in variances 
among k > 2 samples [2^ [2^. As already indicated, 
in all cases applied to our data, we could not reject the 
null hypothesis of equal variances at the 0.05 level. Data 
are compared using matched sets, e.g. GT M&I Pre-test 
scores and GT M&I Post-test scores or GT M&I GPAs 
and GT traditional GPAs, etc.. The inability to reject 
the null hypothesis at 0.05 level applies to all matched 
sets listed in Section lA 31 of this Appendix. 
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6. Gaining Confidence Intervals 

The specification of confidence intervals for selected 
parameters of population distributions is a common al- 
ternative to formal hypothesis testing. Indeed, many re- 
searchers much prefer this method when it can be applied 
for reasons we will not explore here [2lj. But, simply 
put, confidence intervals can provide a "quick picture" 
of bounds on a population parameter based on sampling 
distribution estimates. They allow one to see if putative 
differences between group treatments are worth consider- 
ing. This derives from our obtaining bounds on estimates 
of population parameters, such as means, medians, or 
variances. For simplicity, let us assume we are sampling 
from a normal distribution with known variance, cr^, and 
attempting to determine bounds on the population mean, 

by selecting samples of size n and computing the sam- 
ple mean, X. The sampling distribution of 



(a/V^) 



(A20) 



is the unit normal distribution A^(0, 1). If we were to 
randomly draw one z statistic from this distribution then 
the probability that the obtained z will come from the 
open interval (-zo.025, ■20.025) is 



P (-^0.025 <z< Z0.025) = 1 - 0.05 = 0.95. (A21) 



of freedom given hy v = n—1. The degrees of freedom is 
smaller than n because we are using the sample to esti- 
mate the standard deviation. The appropriate values of 
t are found in standard tables [2^ . As v grows large, the 
t-distribution approaches normal, so for values greater 
than about 120, the z-table may be used. 

We need to emphasize that it is not true that for any 
given sample the probability, a, is that the mean, lies 
within that sample. Once X is specified, it is no longer a 
random variable; either ^ lies in that interval, or it does 
not. Keep in mind that the analysis derives from consid- 
ering all possible random samples of size n drawn from 
the population to yield a distribution of confidence inter- 
vals. Ninety-five percent of those intervals will include fi 
within the limits of ±to.o25(o'/-\/n), but 5% will not. 

Parametric confidence-interval determinations (e.g., 
using the z statistic or the ^-distribution) based on as- 
sumptions of normality (or at least symmetry with large 
sample sizes) may not be appropriate when confronted 
with non-normal, asymmetric distributions of the sort we 
encountered in our study. However, as stated earlier, the 
t-distribution is relatively robust with respect to normal- 
ity provided the distribution is not significantly skewed. 
We obtained a measure of skewness for our sample dis- 
tributions and determined that our distributions did not 
significantly depart from symmetric [2^ . We thus used 
the ^-statistic for all the determinations of 95% confi- 
dence intervals shown in Figures [B El [5l [HI [71 [51 [HI and 

m 



These values define confidence limits of the 95% confi- 
dence interval for the population mean based on random 
samples drawn from that population. The 0.95 probabil- 
ity specification is called the confidence coefficient. The 
probability expressed in terms of the sample statistic is 
then 



(J - a 

P \ X ~ Zo.02.5^^ < H < X + ZQ ij2b—J= 

Jn Jn 



(A22) 



More generally, we can find a two-sided 100(1 — a)% con- 
fidence interval for the mean, /i. 



- (T - <T 

X - Z!L—= < < X + Z!l—=. 

2 Jn 2 Jn 
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Generally speaking the population variance is un- 
known. It may be estimated from the sample in which 
case, given certain assumptions below, the t-statistic is 
the more appropriate and the two-sided 100(1 — a) con- 
fidence interval is then: 



X-t, 



< fi< X + t. 



(A24) 



where s is the standard deviation estimate from the sam- 
ple (see Section [A 31 in this appendix) and v is the degrees 



7. Using Contingency Tables 

An analysis of a group of items within a given set di- 
vulges the contribution made by those items to the over- 
all set. This is the approach taken in the first part of 
Section [VlII Alternatively, one can ask whether there is 
an association between two variables in this set. Do we 
find an association between one variable (treatment) and 
another variable (performance) on a given topic? Con- 
tingency table analysis can describe whether an associa- 
tion between treatment and performance exists and the 
confidence level of that association. When using contin- 
gency table analysis, one understands that the p-values 
obtained are conservative as compared to those obtained 
using parametric tests [2^. 

The approach is to form a table of events. An event 
can be any number of countable items. In our case, it 
will be total score on a given topic tested on the BEMA. 
This section will provide an example using data from 
the Magnetostatics item analysis for Georgia Tech given 
in Section IVIII By separating the responders into their 
given sections, traditional versus M&I, and counting each 
responder's overall score on a given topic, one has pro- 
posed a valid contingency table. This table appears as 
the middle two columns, Omi and Otrad, in Table [TTl 
A valid contingency table requires that no responder is 
counted twice. One could not use individual items as 
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the events as a responder may have gotten several differ- 
ent questions correct. Using the total number of correct 
items ensures that a responder is counted only once. 



TABLE II: Observed counts for number of correct answers 
in the Magnet ost at ics (MS) topic for the BEMA post-test 
are shown for all Matter & Interactions and all traditional 
sections at Georgia Tech. The total number of correct items is 
denoted by A^c (maximum: 9). The number of students with 
Ac correct answers appears in the column Omi for Matter 
& Interactions and Otrad for traditional. The sum of these 
columns appears in Ot. 





Omi 


Otrad 


Ot 





7 


45 


52 


1 


21 


155 


176 


2 


33 


224 


257 


3 


47 


227 


274 


4 


59 


195 


254 


5 


90 


142 


232 


6 


118 


113 


231 


7 


102 


72 


174 


8 


87 


56 


143 


9 


48 


17 


65 



TABLE III: Expected counts for number of correct answers 
in the Magnetostatics (MS) topic for the BEMA post-test are 
shown for all Matter & Interactions and all traditional sec- 
tions at Georgia Tech. The total number of expected correct 
items is denoted by Ac (maximum: 9). The expected number 
of students with Ac correct answers appears in the column 
Emi for Matter & Interactions and Etrad for traditional. 
The sum of these columns appears in Et. 





Emi 


Etrad 


Et 





17.13 


34.87 


52 


1 


57.97 


118.03 


176 


2 


84.65 


172.35 


257 


3 


90.25 


183.75 


274 


4 


83.66 


170.34 


254 


5 


76.42 


155.58 


232 


6 


76.09 


154.91 


231 


7 


57.31 


116.69 


174 


8 


47.10 


95.90 


143 


9 


21.41 


43.59 


65 



After counting the events, labeled A^,., , the column and 
row sums for table are computed. Summing down the 
column, 



(A25) 



is equivalent to counting the total number of responders 
in each treatment in Table [TTl While summing across the 
rows, 



(A26) 



is equivalent to counting the total number of responders 
with a given score regardless of treatment. These num- 
bers appear in column Ot in Table HH One can determine 
the total number of responders by summing all rows and 
columns. 



(A27) 



This is equivalent to summing up the entries in column 
Ot in Table [U 

We are able to compute an expected value for the num- 
ber of events, , and compare that expectation value to 
the actual count. If treatment has no effect on the scores 
- that is, if we cannot distinguish any association between 
the treatment and score, we expect that the fraction of 
events in a given row is the same regardless of treatment. 
We can propose the null hypothesis. 



Ho : 



N 



N 



with the alternative hypothesis. 



N 



N 
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Table Hill illustrates these expected values for the Mag- 
netostatics topic. The columns Emi and Etrad contain 
the expected number of students with a given score, Nc. 
We can do a quick comparison of rows between Tables HI] 
and mil This provides an interesting contrast of higher 
(lower) expectations and actual counts. 

A more rigorous approach is to perform a chi-square 
analysis with this expectation value, riy . We calculate 
the chi-square statistic as follows. 



i/=(/-i)(J-i) 



(A30a) 
(A30b) 



where v is the number of degrees of freedom in the chi- 
square analysis. The degrees of freedom is determined by 
the number of rows, /, and the number of columns, J in 
our contingency table (in our example / = 10,J = 2, so;/ 
= 9). One can compare the reduced form of this statistic, 
X^/i^, at a given confidence level, a, to computed values 
give n in relevant texts or using any statistical package 
[2^. Our example yields = 322.46, so that x^/^^ = 
35.83. The critical value, for which we find our reduced 
statistic to be above, is Xcrit/^ — 1-880. The p-value 
for our observed reduced chi-square statistic is much less 
than 0.0001. This shows significant association between 
treatment and score for the Magnetostatics topic on the 
BEMA post-test. 

After performing this analysis, we found no associa- 
tion between treatment and score for the BEMA pre- 
test at Georgia Tech at the a — 0.05 level (all p-values 
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were moderate, p > 0.20). However, the BEMA post-test 
scores showed a significant association between score and 
treatment at the a = 0.05 level. The higher mean val- 
ues achieved by the M&I treatment dictate that the M&I 



course is more efi^cctive for all topics; Electrostatics {p < 
0.001), DC Circuits {p < 0.001), Magnetostatics {p « 
0.0001) and Faraday's Law {p « 0.0001). 
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